Lab2 tutorial

Yue You

2021-07-23

Lab 2 Overview

First Half

More Visualisations:

  • Heatmaps

  • HexBins vs SquareBins

  • Choropleths

Descriptive Statistics:

Second Half

Revision:

  • Any code related questions for R

Let’s get started

Visualisations

  • Binned plots

  • Good visualizations vs Bad visualizastions

Personal Checklist for Visualisations and Dashboards:

  • Your visualisation needs to tell a story.

  • It should be intepretable without being overly verbose.

  • The scale and axis need to make sense (and you can assume the reader knows the difference between a normal scale vs log scale).

  • The choice of visualisation needs to make sense:

  1. Line plot vs Bar chart with non-numerical categories
  2. Scatterplot vs Histogram plot to see distribution
  3. etc
  • Choice of colour scheme / alpha / size need to be easy on the eyes.

At the end of the day, even if you think your visualisation is “pretty” or “beautiful”, if a reader cannot understand it, then it is not a good visualisation.

Load libraries

library(dplyr)
library(sf)
library(curl)
library(ggmap)
library(tmap)
library(tmaptools)
library(feather)

Load data

#try feather 
filepath = "../data/df.feather"
df <- read_feather(filepath)
df %>% tail
## # A tibble: 6 x 19
##   VendorID tpep_pickup_dateti… tpep_dropoff_datet… passenger_count trip_distance
##      <int> <fct>               <fct>                         <int>         <dbl>
## 1        2 4/12/15 22:55       4/12/15 23:28                     2         12.8 
## 2        2 4/12/15 22:55       4/12/15 23:03                     1          0.75
## 3        1 4/12/15 22:55       4/12/15 23:08                     1          2.4 
## 4        1 4/12/15 22:55       4/12/15 23:01                     1          0.8 
## 5        2 4/12/15 22:55       4/12/15 23:17                     1          4.73
## 6        2 4/12/15 22:55       4/12/15 22:59                     2          0.8 
## # … with 14 more variables: pickup_longitude <dbl>, pickup_latitude <dbl>,
## #   RatecodeID <int>, store_and_fwd_flag <fct>, dropoff_longitude <dbl>,
## #   dropoff_latitude <dbl>, payment_type <int>, fare_amount <dbl>, extra <dbl>,
## #   mta_tax <dbl>, tip_amount <dbl>, tolls_amount <dbl>,
## #   improvement_surcharge <dbl>, total_amount <dbl>

Heatmaps

  • Excellent tool when visualising geospatial coordinates.

Select the map.

xranges <- range(df$pickup_longitude[!df$pickup_longitude==0])
yranges <- range(df$pickup_latitude[!df$pickup_latitude==0])
xranges
## [1] -77.04710 -71.06483
yranges
## [1] 37.27044 42.73614
map_big <- get_stamenmap(
  rbind(-74.3,40.45,-73.75,40.95), 
  zoom = 10)
ggmap(map_big)

Plot pickup locations

ggmap(map_big) + 
  geom_bin2d(data = df,
             aes(x = pickup_longitude,
                 y = pickup_latitude),
             bins=50) 

Change the color scale.

ggmap(map_big) + 
  stat_bin2d(data = df,
             aes(x = pickup_longitude,
                 y = pickup_latitude),
             bins=50) +
  scale_fill_viridis_c(trans = "log10")

ggmap(map_big) + 
  stat_bin2d(data = df,
             aes(x = pickup_longitude,
                 y = pickup_latitude),
             bins=50) +
  scale_fill_gradientn(colours = c("darkred", "orange", "yellow", "white"),trans="log10")

Filled by discrete factors.

ggmap(map_big) + 
  stat_bin2d(data = df,
             aes(x = pickup_longitude,
                 y = pickup_latitude,
                 fill = as.factor(passenger_count)),
             bins=50) 

Change the color palette. A good palette makes your story better.

library(RColorBrewer)
display.brewer.all()

ggmap(map_big) + 
  stat_bin2d(data = df,
             aes(x = pickup_longitude,
                 y = pickup_latitude,
                 fill = as.factor(payment_type)),
             bins=100) +
  scale_fill_brewer(palette = "Dark2")

ggmap(map_big) + 
  stat_bin2d(data = df,
             aes(x = pickup_longitude,
                 y = pickup_latitude,
                 fill = as.factor(payment_type),
                 alpha = 0.95),
             bins=100) +
  scale_fill_brewer(palette = "Dark2")

Hexagon bins avoid the visual artefacts sometimes generated by the very regular alignment of geom_bin2d().

Plot HexBin.

ggmap(map_big) + 
  stat_binhex(data=df,
           aes(x = pickup_longitude, y = pickup_latitude),
               bins=100) +
  coord_cartesian()  +
  scale_fill_gradient2(trans="log10")

ggmap(map_big) + 
  stat_binhex(data=df,
           aes(x = pickup_longitude, y = pickup_latitude),
           alpha = 0.9,
           bins=100) +
  coord_cartesian()  +
  scale_fill_viridis_c(trans = "log10") +
  facet_grid(.~payment_type)

Shapefile Overlays

  • NOTE: This only applies on the more recent datasets that use zones over coordinates

Requirements:

  • ggplot2/ ggmap/ tmap/ leaflet

Shapefile Links:

taxi_zones.zip taxi+_zone_lookup.csv

df_yellow = read.csv("../data/yellow_tripdata_2019-01.csv",stringsAsFactors = TRUE)
df_yellow %>% 
  group_by(PULocationID, passenger_count) %>% 
  summarise(n = sum(passenger_count)) -> df_yellow_summary
## `summarise()` has grouped output by 'PULocationID'. You can override using the `.groups` argument.
library(sf)
sf = st_read("../data/shapefile/taxi_zones/taxi_zones.shp")
## Reading layer `taxi_zones' from data source 
##   `/mnt/student.unimelb.edu.au/yueyou/MAST30034_R/data/shapefile/taxi_zones/taxi_zones.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 263 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
## Projected CRS: NAD83 / New York Long Island (ftUS)
zone = read.csv("../data/shapefile/taxi+_zone_lookup.csv")

Basic information in shapefile.

sf = st_read("../data/shapefile/taxi_zones/")
## Reading layer `taxi_zones' from data source 
##   `/mnt/student.unimelb.edu.au/yueyou/MAST30034_R/data/shapefile/taxi_zones' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 263 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
## Projected CRS: NAD83 / New York Long Island (ftUS)
plot(sf)

Choropleth mapping with ggplot2

ggplot() + 
  geom_sf(data = sf, size = 1, color = "black", fill = "cyan1") + 
  ggtitle("Boundary Plot") + 
  coord_sf()+
  theme_bw()

Adding basemaps with ggmap

ggmap(map_big) +
    geom_sf(data = sf, aes(fill=Shape_Area), inherit.aes = FALSE)  +
  coord_sf(crs = st_crs(4326)) +
  scale_fill_viridis_c()

Choropleth with tmap

tmap is specifically designed to make creation of thematic maps more convenient. It borrows from thw ggplot syntax and takes care of a lot of the styling and aesthetics. This reduces our amount of code significantly. We only need:

  • tm_shape() where we provide the sf object (we could also provide an SpatialPolygonsDataframe)

  • tm_polygons() where we set the attribute variable to map, the break style, and a title.

tm_shape(sf)+
   tm_polygons("Shape_Area",style="quantile")

Web mapping with leaflet

library(leaflet) 

Incorporate the information we have.

df_yellow_summary$LocationID <- df_yellow_summary$PULocationID
full_join(sf,df_yellow_summary,by="LocationID") -> sf_full

While tmap was tolerant about our AEA projection of philly_crimes_sf, leaflet does require us to explicitly reproject the sf object.

# reproject
sf_WGS84 <- st_transform(sf_full, 4326)
leaflet(sf_WGS84) %>%
  addPolygons()
sf_WGS84 %>% filter(!is.na(n)) ->sf_plot
pal_fun <- colorQuantile("YlOrRd", NULL, n = 5)
p_popup <- paste0("<strong>Passanger number: </strong>", sf_plot$n)
leaflet(sf_plot) %>%
  addPolygons(
    stroke = FALSE, 
    fillColor = ~pal_fun(n),
    fillOpacity = 0.8, smoothFactor = 0.5,
    popup = p_popup) %>%
  addTiles() %>%
  addLegend("bottomright",  # location
            pal=pal_fun,    # palette function
            values=~n,  # value to be passed to palette function
            title = 'Passanger number') # legend title

Read more about Reprojecting geographic data.

Descriptive Statistics

filepath = "../data/df.feather"
df <- read_feather(filepath)
summary(df)
##     VendorID        tpep_pickup_datetime   tpep_dropoff_datetime
##  Min.   :1.000   4/12/15 19:14:  508     4/12/15 21:00:  487    
##  1st Qu.:1.000   4/12/15 22:44:  506     4/12/15 21:40:  486    
##  Median :2.000   4/12/15 21:57:  503     4/12/15 20:14:  484    
##  Mean   :1.528   4/12/15 22:46:  491     4/12/15 20:58:  480    
##  3rd Qu.:2.000   4/12/15 19:39:  489     4/12/15 21:01:  480    
##  Max.   :2.000   4/12/15 19:28:  486     4/12/15 20:31:  476    
##                  (Other)      :97017     (Other)      :97107    
##  passenger_count trip_distance    pickup_longitude pickup_latitude
##  Min.   :0.000   Min.   : 0.000   Min.   :-77.05   Min.   : 0.00  
##  1st Qu.:1.000   1st Qu.: 1.030   1st Qu.:-73.99   1st Qu.:40.73  
##  Median :1.000   Median : 1.710   Median :-73.98   Median :40.75  
##  Mean   :1.704   Mean   : 2.827   Mean   :-72.88   Mean   :40.14  
##  3rd Qu.:2.000   3rd Qu.: 3.100   3rd Qu.:-73.97   3rd Qu.:40.77  
##  Max.   :6.000   Max.   :91.200   Max.   :  0.00   Max.   :42.74  
##                                                                   
##    RatecodeID     store_and_fwd_flag dropoff_longitude dropoff_latitude
##  Min.   : 1.000   N:99421            Min.   :-74.62    Min.   : 0.00   
##  1st Qu.: 1.000   Y:  579            1st Qu.:-73.99    1st Qu.:40.73   
##  Median : 1.000                      Median :-73.98    Median :40.75   
##  Mean   : 1.029                      Mean   :-72.93    Mean   :40.17   
##  3rd Qu.: 1.000                      3rd Qu.:-73.96    3rd Qu.:40.77   
##  Max.   :99.000                      Max.   :  0.00    Max.   :41.49   
##                                                                        
##   payment_type   fare_amount          extra            mta_tax       
##  Min.   :1.00   Min.   :-120.00   Min.   :-1.0000   Min.   :-0.5000  
##  1st Qu.:1.00   1st Qu.:   7.00   1st Qu.: 0.5000   1st Qu.: 0.5000  
##  Median :1.00   Median :  10.00   Median : 0.5000   Median : 0.5000  
##  Mean   :1.34   Mean   :  12.75   Mean   : 0.6049   Mean   : 0.4985  
##  3rd Qu.:2.00   3rd Qu.:  15.00   3rd Qu.: 0.5000   3rd Qu.: 0.5000  
##  Max.   :4.00   Max.   : 500.00   Max.   : 1.5000   Max.   : 2.5000  
##                                                                      
##    tip_amount      tolls_amount     improvement_surcharge  total_amount    
##  Min.   :  0.00   Min.   : 0.0000   Min.   :-0.3000       Min.   :-120.30  
##  1st Qu.:  0.00   1st Qu.: 0.0000   1st Qu.: 0.3000       1st Qu.:   9.30  
##  Median :  1.46   Median : 0.0000   Median : 0.3000       Median :  12.80  
##  Mean   :  1.80   Mean   : 0.2386   Mean   : 0.2997       Mean   :  16.19  
##  3rd Qu.:  2.55   3rd Qu.: 0.0000   3rd Qu.: 0.3000       3rd Qu.:  18.36  
##  Max.   :115.00   Max.   :24.0000   Max.   : 0.3000       Max.   : 550.30  
## 

Remember that not all the data should be interpreted as purely numerical. There may be conclusions you can draw by coincidence if you incorrectly assume data types! For example:

  • longitude and latitude should be interpreted as geospatial coordinates

  • payment_type is a discrete category of payment types

  • trip_distance is non-linear (not a straight line from A to B), but we have no further data on it. It is also in miles.

  • To avoid misinterpreting the attributes, refer to the data dictionary provided on the TLC website.

Scatterplot plots for Fare / Distance

ggplot(df) +
  aes(x=fare_amount,y=trip_distance) +
  geom_point()+
  theme_bw()

General Inference:

  • We can visually see that the relationship is relatively linear as you’d expect (more distance generally means more money)
  • A fair number of outliers, notably around the 0 distance axis and 0 cost axis. Why might this be the case?
  • Why are there negative values?

Histgram plots for Fare amount

ggplot(df)+
  geom_histogram(aes(x=fare_amount),bins =30, fill="darkblue") +
  theme_bw()

  • I guess we can kind of see most fares are between 0 - 100.
  • Hard to tell where the main distribution is spread around.
  • We can enhance this using a log transformation.
df$transform_fare_amount <- log(df$fare_amount)
ggplot(df)+
  geom_histogram(aes(x=transform_fare_amount),bins = 30, fill="darkblue") +
  theme_bw()

  • Take a log transformation to visually see the distribution

Do a correlation check.

cor(df$trip_distance,df$fare_amount,method = "pearson")
## [1] 0.899978
library(pheatmap)
#remove characters
pheatmap(cor(df[,-c(1:3,9)]),
         cluster_rows = FALSE,
         cluster_cols = FALSE)

  • trip_distance highly correlates with high tips, tolls and overall trip amount
  • payment_type seems to have some form of negative correlation with tip_amount. Must be careful as this is a discrete category.
cors =c("passenger_count", "trip_distance", "fare_amount", "extra", 
 "mta_tax", "tip_amount", "tolls_amount", "improvement_surcharge", "total_amount")
cor(df[,cors])
##                       passenger_count trip_distance   fare_amount        extra
## passenger_count          1.0000000000   0.001658362  0.0009505592 -0.003093358
## trip_distance            0.0016583622   1.000000000  0.8999780236 -0.223323624
## fare_amount              0.0009505592   0.899978024  1.0000000000 -0.207022734
## extra                   -0.0030933583  -0.223323624 -0.2070227338  1.000000000
## mta_tax                  0.0131316387  -0.027817098 -0.1282543385  0.130305153
## tip_amount              -0.0111149872   0.530623359  0.5573861481 -0.100025231
## tolls_amount             0.0065090920   0.572059235  0.5307124782 -0.157824284
## improvement_surcharge    0.0040463644   0.015943787  0.0489414626  0.095486378
## total_amount            -0.0007104667   0.898539215  0.9824415224 -0.186521180
##                           mta_tax  tip_amount tolls_amount
## passenger_count        0.01313164 -0.01111499  0.006509092
## trip_distance         -0.02781710  0.53062336  0.572059235
## fare_amount           -0.12825434  0.55738615  0.530712478
## extra                  0.13030515 -0.10002523 -0.157824284
## mta_tax                1.00000000 -0.06226707 -0.094005581
## tip_amount            -0.06226707  1.00000000  0.359383270
## tolls_amount          -0.09400558  0.35938327  1.000000000
## improvement_surcharge  0.59152689  0.01722265  0.004483847
## total_amount          -0.12168476  0.68568923  0.603590674
##                       improvement_surcharge  total_amount
## passenger_count                 0.004046364 -0.0007104667
## trip_distance                   0.015943787  0.8985392152
## fare_amount                     0.048941463  0.9824415224
## extra                           0.095486378 -0.1865211796
## mta_tax                         0.591526892 -0.1216847619
## tip_amount                      0.017222652  0.6856892276
## tolls_amount                    0.004483847  0.6035906745
## improvement_surcharge           1.000000000  0.0486015276
## total_amount                    0.048601528  1.0000000000
pheatmap(cor(df[,cors]))

Other plots to present correlations.

library(corrplot)
## corrplot 0.90 loaded
corrplot(cor(df[,cors]), method = 'number')

corrplot(cor(df[,cors]), method = 'color', order = 'alphabet') 

corrplot(cor(df[,cors]), method = 'color', order = 'AOE')  # after 'AOE' reorder

More references to draw cor plots.

Study Payment type and Distance

library(forcats)
df = df %>% 
  mutate(payment_type = as_factor(payment_type)) %>% 
  mutate(payment_type = recode(payment_type,`1` = "Credit Card", `2` = "Cash", `3` = "No charge", `4` = "Dispute"))
# Check category frequencies
fct_count(df$payment_type)
## # A tibble: 4 x 2
##   f               n
##   <fct>       <int>
## 1 Credit Card 66601
## 2 Cash        32955
## 3 No charge     336
## 4 Dispute       108
#Collapse Cash/No charge/Dispute
df = df %>% 
  mutate(payment_type = fct_collapse(payment_type, 
                                     Credit_Card = c("Credit Card"), 
                                     Cash_Or_Others=c("Cash","No charge","Dispute")))
ggplot(df, aes(payment_type)) + 
  geom_bar(fill=c("pink","orange")) + 
  ylab("Frequency") + 
  ggtitle("Barplot of Payment Methods")+ 
  theme_bw()

ggplot(df) +
  geom_histogram(aes(x=trip_distance, fill=payment_type))+ 
  theme_bw()

ggplot(df) +
  geom_histogram(aes(x=trip_distance, fill=payment_type),
                 breaks=seq(0,20, by=1))+ 
  theme_bw()+
  scale_fill_brewer(palette = "Paired")

Use facets.

ggplot(df) +
  geom_histogram(aes(x=trip_distance, fill=payment_type),
                 breaks=seq(0,20, by=1))+ 
  theme_bw()+
  facet_grid(.~payment_type) +
  scale_fill_brewer(palette = "Paired")

Use Spark in R

See sparklyr tutorial.

SessionInfo

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
##  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
##  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] forcats_0.5.1      corrplot_0.90      pheatmap_1.0.12    leaflet_2.0.4.1   
##  [5] RColorBrewer_1.1-2 feather_0.3.5      tmaptools_3.1-1    tmap_3.3-2        
##  [9] ggmap_3.0.0        ggplot2_3.3.5      curl_4.3.2         sf_1.0-1          
## [13] dplyr_1.0.7       
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7            httr_1.4.2              tools_4.0.3            
##  [4] bslib_0.2.5.1           utf8_1.2.1              R6_2.5.0               
##  [7] KernSmooth_2.23-17      DBI_1.1.1               colorspace_2.0-2       
## [10] raster_3.4-13           withr_2.4.2             sp_1.4-5               
## [13] tidyselect_1.1.1        compiler_4.0.3          leafem_0.1.6           
## [16] cli_3.0.0               labeling_0.4.2          bookdown_0.22          
## [19] sass_0.4.0              scales_1.1.1            classInt_0.4-3         
## [22] hexbin_1.28.2           proxy_0.4-26            stringr_1.4.0          
## [25] digest_0.6.27           rmarkdown_2.9           base64enc_0.1-3        
## [28] dichromat_2.0-0         jpeg_0.1-8.1            pkgconfig_2.0.3        
## [31] htmltools_0.5.1.1       highr_0.9               htmlwidgets_1.5.3      
## [34] rlang_0.4.11            rstudioapi_0.13         jquerylib_0.1.4        
## [37] generics_0.1.0          farver_2.1.0            jsonlite_1.7.2         
## [40] crosstalk_1.1.1         magrittr_2.0.1          s2_1.0.6               
## [43] Rcpp_1.0.7              munsell_0.5.0           fansi_0.5.0            
## [46] abind_1.4-5             lifecycle_1.0.0         stringi_1.7.2          
## [49] leafsync_0.1.0          yaml_2.2.1              plyr_1.8.6             
## [52] grid_4.0.3              parallel_4.0.3          crayon_1.4.1           
## [55] lattice_0.20-41         stars_0.5-3             hms_1.1.0              
## [58] knitr_1.33              pillar_1.6.1            rjson_0.2.20           
## [61] codetools_0.2-16        wk_0.5.0                XML_3.99-0.6           
## [64] glue_1.4.2              evaluate_0.14           leaflet.providers_1.9.0
## [67] png_0.1-7               vctrs_0.3.8             rmdformats_1.0.2       
## [70] RgoogleMaps_1.4.5.3     gtable_0.3.0            purrr_0.3.4            
## [73] tidyr_1.1.3             assertthat_0.2.1        xfun_0.24              
## [76] lwgeom_0.2-6            e1071_1.7-7             class_7.3-17           
## [79] viridisLite_0.4.0       tibble_3.1.2            units_0.7-2            
## [82] ellipsis_0.3.2